Advanced Geospatial Data Processing for Social Scientists
Dennis Abel & Stefan Jünger
2025-04-29
Now
Day
Time
Title
April 28
10:00-11:15
Introduction
April 28
11:15-11:30
Coffee Break
April 28
11:30-13:00
Raster data in R
April 28
13:00-14:00
Lunch Break
April 28
14:00-15:15
Raster data processing
April 28
15:15-15:30
Coffee Break
April 28
15:30-17:00
Graphical display of raster data in maps
April 29
10:00-11:15
Datacube processing I
April 29
11:15-11:30
Coffee Break
April 29
11:30-13:00
Datacube processing II & API access
April 29
13:00-14:00
Lunch Break
April 29
14:00-15:15
Data integration and linking (with survey data)
April 29
15:15-15:30
Coffee Break
April 29
15:30-17:00
Outlook and open session with own application
Datacubes - continued
Building on the introduction to data stacks and cubes in R, we now want to dig deeper into wrangling with these objects. We will cover more advanced raster operations and have a case study on accessing remote sensing data from an API. But first, we will briefly revise visualizing with ggplot - this time with our raster stacks/cubes.
Visualizing datacubes
Yesterday, you learned about visualizing SpatRaster objects with ggplot. Extending this approach to raster stacks and cubes is quite straightforward. We can create multiple graphs within one plot (“facets”) with ggplot which allows us to visualize two or more layers in one plot.
Many of you will know facet_wrap() or facet_grid() to separate standard graphs like scatterplots by a third grouping variable. The same logic applies to our third layer dimension (which in our case represents time but could potentially be any other band like RGB).
Visualizing terra raster stacks
We will start with our terraSpatRaster and load the raster stack created in the previous session.
class : SpatRaster
dimensions : 1287, 1168, 4 (nrow, ncol, nlyr)
resolution : 828.2603, 828.2603 (x, y)
extent : -415531, 551877.1, -607609.6, 458361.4 (xmin, xmax, ymin, ymax)
coord. ref. : NAD83 / California Albers (EPSG:3310)
source : CA_pop_stack.tif
names : pop_2017, pop_2018, pop_2019, pop_2020
min values : 0.0, 0.00, 0.00, 0.00
max values : 22052.6, 22123.96, 22208.66, 22285.26
unit : count, count, count, count
time (years): 2017 to 2020
Visualizing terra raster stacks
You know the drill - ggplot() initializes our blank canvas which we fill with our geoms. tidyterra'sgeom_spatraster() is used for plotting SpatRaster objects. With your knowledge on indexing single layers within a raster stack, we can create a single plot like yesterday.
For visualizing all four layers simultaneously, we can utilize the facet_wrap() command. Instead of providing the name of a third variable to the command, we call lyr, which geom_spatraster() recognizes as the third dimension.
Creating custom labels for your facets can generally be quite painful (not just for geodata). By default, it uses the values from the names of the SpatRaster attribute. If you want to change labels, you either change the names values (which we often don’t want to) or create a workaround. For example, here we would like to use the time values.
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
population [count] 0 0 0 1.894517 0.1030732 1906.748 71328
dimension(s):
from to offset delta refsys x/y
x 1 1171 -415531 826.2 NAD83 / California Albers [x]
y 1 1279 458361 -826.2 NAD83 / California Albers [y]
time 1 4 2017 1 NA
Visualizing stars raster cubes
Base R plotting of stars objects is possible. However, the default color-scheme is limited and often results in plots like this one which are relatively uninformative. It would require manual workarounds on breaks to adjust this. We prefer to directly move on to ggplot.
plot(CA_pop_cube[,,,1])
Visualizing stars raster cubes
Setting up your visualization with ggplot is very similar to the approach before for SpatRaster data. geom_stars allows us to directly input a stars object.
With facet_wrap, we can create a combination of several layers into one plot. Given that stars objects can have >1 bands, we specify it with the exact name. Here it is “time”. Facet labels are directly derived from that band. We do not need to wrangle with labeling further.
Now that we have established an understanding of raster stacks and cubes in R, we would like to introduce a case study in nighttime lights. We will briefly introduce:
Remote sensing and how nighttime lights data is generated,
how we can access that data via a public API and,
how to wrangle with that data.
In particular, we will focus on working with and “harmonizing” raster layers with different spatial properties.
R package blackmarbler
blackmarbleR by Robert Marty and Gabriel Stefanini Vicente (2025) supports easy access to NASA’s Black Marble API. Let’s check out their vignette to set up an account and the data retrieval.
library(blackmarbler)
Data retrieval
The function bm_raster() to retrieve the nighttime lights requires as input an sfobject to determine the spatial extent of the downloaded data. The object must be in WGS84.
We will focus on California. Let’s load US states shapefiles with the tigris package and subset to California.
Simple feature collection with 1 feature and 15 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -124.482 ymin: 32.52951 xmax: -114.1312 ymax: 42.0095
Geodetic CRS: WGS 84
REGION DIVISION STATEFP STATENS GEOID GEOIDFQ STUSPS NAME LSAD
1 4 9 06 01779778 06 0400000US06 CA California 00
MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON
1 G4000 A 403673433805 20291632828 +37.1551773 -119.5434183
geometry
1 MULTIPOLYGON (((-119.9999 4...
Data retrieval
Once you have set up your profile at NASA’s Earth Data Portal and generated your API token, you can assign it to an object in R for the data retrieval.
NOTE: I stored mine as an environment variable in the system so that I don’t hard code it into these slides. If you work locally on your own laptop, this is not necessary.
bearer <-Sys.getenv("NASA-token")# If you work locally, directly assign it# bearer <- "YOUR_TOKEN"
Data retrieval
We are ready to download the data from NASA’s API.
CA_nl_stack <-bm_raster(roi_sf = CA_sf,product_id ="VNP46A4", # for yearly datadate =2017:2020, # same four years like our population databearer = bearer, # your API tokenoutput_location_type ="file", # we want to store geotiff on diskfile_dir ="./data/", # where to store geotifffile_return_null =FALSE# also create SpatRaster file)
Data retrieval
By default, the function writes the data to the R environment (output_location_type = "memory"). If you want to store it as single GeoTIFFs, specify output_location_type = "file and the file path with file_dir=. file_return_null= further specifies whether the data is additionally loaded to the R environment.
In the previous step, we wrote the data to the disk. Let’s load a single layer for 2020 to inspect the data a bit further.
We want to utilize this NL data to explore a bit further how to create and process raster stacks. This time, we will focus on situations with more than one variable and on cases where the spatial properties of the different layers do not match. We will work with night lights and population data in 2020.
We will zoom to Los Angeles County for this exercise.
# Load LA County fileLA_county <- tigris::counties("CA", progress_bar =FALSE ) |>filter(NAME =="Los Angeles") |>st_transform(crs ="EPSG:4326")# Subset to "mainland" California and # exclude the two islands Santa Catalina and San ClementeLA_county <- LA_county %>%st_cast("POLYGON") %>%mutate(area =st_area(.)) %>%slice_max(area, n =1)
Simple feature collection with 1 feature and 19 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -118.9517 ymin: 33.65955 xmax: -117.6464 ymax: 34.8233
Geodetic CRS: WGS 84
STATEFP COUNTYFP COUNTYNS GEOID GEOIDFQ NAME
1.2 06 037 00277283 06037 0500000US06037 Los Angeles
NAMELSAD LSAD CLASSFP MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT
1.2 Los Angeles County 06 H1 G4020 348 31080 31084 A
ALAND AWATER INTPTLAT INTPTLON area
1.2 10516008433 1784982940 +34.1963983 -118.2618616 10863687022 [m^2]
geometry
1.2 POLYGON ((-118.0288 33.8733...
Layers with different properties
Having our spatial sf file properly set up, we can now prep our two raster files.
# Create LA raster files for population and night lights in 2020LA_pop_2020 <- terra::rast("./data/US-CA_ppp_2020_1km.tif") |> terra::mask(terra::vect(LA_county)) |> terra::crop(LA_county)LA_pop_2020[] <-log(LA_pop_2020[] +1)LA_nl_2020 <- terra::mask(CA_nl_2020, terra::vect(LA_county)) |> terra::crop(LA_county)
Layers with different properties
Visually, there is a strong positive relationship between nighttime lights and population in LA.
plot(LA_pop_2020)
plot(LA_nl_2020)
Layers with different properties
Our two raster files have the same CRS and (almost same) spatial extent. Unfortunately, the resolution (cell size) differs. Our population data is on an approx. 1km grid and our night lights data on an approx. 500m grid.
print(LA_pop_2020)
class : SpatRaster
dimensions : 140, 157, 1 (nrow, ncol, nlyr)
resolution : 0.008333333, 0.008333333 (x, y)
extent : -118.9512, -117.6429, 33.65792, 34.82458 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
varname : US-CA_ppp_2020_1km
name : US-CA_ppp_2020_1km
min value : 0.000000
max value : 9.843707
print(LA_nl_2020)
class : SpatRaster
dimensions : 280, 313, 1 (nrow, ncol, nlyr)
resolution : 0.004166667, 0.004166667 (x, y)
extent : -118.95, -117.6458, 33.65833, 34.825 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source(s) : memory
varname : VNP46A4_NearNadir_Composite_Snow_Free_qflag_t2020
name : t2020
min value : 0.000000
max value : 4.400603
Layers with different properties
Most simple solution to harmonize: If cell sizes are multiples (which is the case in our example), we can use terra::disagg() or terra::aggregate() to adjust resolution of one layer to resolution of another.
First option: Increase resolution for population data
# Increase resolution for population dataLA_pop_2020_high <- terra::disagg(LA_pop_2020,fact =c(2, 2),method ="bilinear")# There is still a slight mismatch due to rounding errors (one more ncol)# Let's crop to the spatial extent of the nightlights dataLA_pop_2020_high <-crop( LA_pop_2020_high, terra::ext(LA_nl_2020))
Layers with different properties
Most simple solution to harmonize: If cell sizes are multiples (which is the case in our example), we can use terra::disagg() or terra::aggregate() to adjust resolution of one layer to resolution of another.
First option: Increase resolution for population data
# A small rounding error in extent will prohibit to concatenate # into a stack. We now force the extentext(LA_pop_2020_high) <-ext(LA_nl_2020)
Layers with different properties
Most simple solution to harmonize: If cell sizes are multiples (which is the case in our example), we can use terra::disagg() or terra::aggregate() to adjust resolution of one layer to resolution of another.
Second option: Decrease resolution for nightlights data
# Decrease resolution for nightlights dataLA_nl_2020_low <- terra::aggregate(LA_nl_2020,fact =c(2, 2),method ="bilinear")# There is still a slight mismatch due to rounding errors (one more ncol)# Let's crop to the spatial extent of the population dataLA_nl_2020_low <-crop( LA_nl_2020_low, terra::ext(LA_pop_2020))
Layers with different properties
Most simple solution to harmonize: If cell sizes are multiples (which is the case in our example), we can use terra::disagg() or terra::aggregate() to adjust resolution of one layer to resolution of another.
Second option: Decrease resolution for nightlights data
More complex: Imputation logics. Previous examples follow the idea of interpolating existing data across the spatial domain. Imputation fills in missing values based on a prediction model. Let’s consider our two variables to make up a stylized example:
We know that population density and nightlights is correlated. We could try to predict the missing values for population based on the values of nightlights to generate the higher resolution population data. In order to do that, we train a RandomForest model on the low resolution data of population and nightlights.
library(randomForest)
Layers with different properties
There is some data prepping to do…
# Covariates need to be in same size as outcome variable = 1kmLA_nl_2020_resampled <-resample(x = LA_nl_2020,y = LA_pop_2020,method ="bilinear")# Create training data - one row per celltrain_data <-as.data.frame(LA_nl_2020_resampled, xy =TRUE, cells =TRUE,na.rm =FALSE) |>left_join(as.data.frame(LA_pop_2020, xy=FALSE, cells=TRUE, na.rm=FALSE),by ="cell") |>rename(nightlights = t2020,population =`US-CA_ppp_2020_1km`)train_data <-na.omit(train_data)
Layers with different properties
Ready to fit the model and predict population data on 500m grid.
# Fit modelout <-randomForest( population ~ nightlights,data = train_data,ntree =500)# Predict on the 500m grid# Covariate names need to matchnames(LA_stack)
Of course this is a simplified approach. What other factors might correlate with population data (and nightlights) and might improve the validity of the imputed cells?
terra::plot(LA_stack)
Stack-based descriptive statistics
# Global univariate meansglobal(LA_stack, fun = mean, na.rm=TRUE)
mean
population 2.9236307
nightlights 0.5277355
population_predicted 2.7963083
# Bivariate correlationslayerCor(LA_stack, fun ="cor", use ="complete.obs")
$correlation
population nightlights population_predicted
population 1.0000000 0.8245851 0.8757795
nightlights 0.8245851 1.0000000 0.8728175
population_predicted 0.8757795 0.8728175 1.0000000
$mean
population nightlights population_predicted
population NaN 2.923631 2.9236307
nightlights 0.5277355 NaN 0.5277355
population_predicted 2.7963083 2.796308 NaN
$n
population nightlights population_predicted
population NaN 59392 59392
nightlights 59256 NaN 59256
population_predicted 59256 59256 NaN